Diffusion probabilistic models - Score matching¶

Author : Philippe Esling (esling@ircam.fr)¶

This notebook explores the new class of generative models based on diffusion probabilistic models [ 1 ] . This class of models is inspired by considerations from thermodynamics [ 2 ] , but also bears strong ressemblence to denoising score matching [ 3 ] , Langevin dynamics and autoregressive decoding. We will also discuss the more recent development of denoising diffusion implicit models [ 4 ] , which bypass the need for a Markov chain to accelerate the sampling. Stemming from this work, we will also discuss the wavegrad model [ 5 ] , which is based on the same core principles but applies this class of models for audio data.

In order to fully understand the inner workings of diffusion model, we will review all of the correlated topics. Therefore, we split the explanation between four detailed notebooks.

  1. Score matching and Langevin dynamics.
  2. Diffusion probabilistic models and denoising
  3. Applications to waveforms with WaveGrad
  4. Implicit models to accelerate inference

Score matching¶

In this section we start by reviewing all the foundations on score matching that can lead to fully understand diffusion models. To do so, we will work on the traditional swiss roll dataset.

In [ ]:
%matplotlib inline
import sys
import os
sys.path.append(os.path.dirname(os.getcwd()))
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import make_swiss_roll
from helper_plot import hdr_plot_style
hdr_plot_style()
# Sample a batch from the swiss roll
def sample_batch(size, noise=1.0):
    x, _= make_swiss_roll(size, noise=noise)
    return x[:, [0, 2]] / 10.0
# Plot it
data = sample_batch(10**4).T
plt.figure(figsize=(16, 12))
plt.scatter(*data, alpha=0.5, color='red', edgecolor='white', s=40)
Out[ ]:
<matplotlib.collections.PathCollection at 0x28cb5d900>
No description has been provided for this image

The idea of score matching was originally proposed by Hyvarinen et al. [ 6 ] . Instead of learning directly the probability of the data $\log p(\mathbf{x})$, we rather aim to learn the gradients of $\log p(\mathbf{x})$ with respect to $x$. In this case, the gradients $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ are termed the score of the density $p(\mathbf{x})$, hence the name score matching. This can be understood as learning the direction of highest probability at each point in the input space. Therefore, when the model is trained, we can improve a sample $x$ by moving it along the directions of highest probability.

However, to perform training, we would need to minimize the error of our model $\mathcal{F}_{\theta}(\mathbf{x})$ at predicting the gradient $\nabla_{\mathbf{x}} \log p(\mathbf{x})$, which amounts to minimize the Fisher divergence, or simply the MSE

$$ \mathcal{L}_{mse} = E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\lVert \mathcal{F}_{\theta}(\mathbf{x}) - \nabla_{\mathbf{x}} \log p(\mathbf{x}) \right\lVert_2^2 \right] $$

The real $\nabla_{\mathbf{x}} \log p(\mathbf{x})$ is usually unknown, but under regularity assumptions on $p(\mathbf{x})$, it can be shown that the minimum of $\mathcal{L}_{mse}$ can be found through a tractable objective

$$ \mathcal{L}_{matching} = E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right) + \frac{1}{2} \left\Vert \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2 \right] , $$

where $\nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x})$ denotes the Jacobian of $\mathcal{F}_{\theta}(\mathbf{x})$ with respect to $\mathbf{x}$, and $ \text{tr}(\cdot) $ is the trace operation.

To perform this optimization, we will rely on Pytorch and define $\mathcal{F}_{\theta}(\mathbf{x})$ as being a neural network

In [ ]:
import torch
import torch.nn as nn
import torch.optim as optim
# Our approximation model
model = nn.Sequential(
    nn.Linear(2, 128), nn.Softplus(),
    nn.Linear(128, 128), nn.Softplus(),
    nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)

Next, we need to define the loss function for the score matching objective. First to compute the Jacobian, we need a specific (and differentiable) function. (This efficient implementation is based on a discussion that can be found here)

In [ ]:
import torch.autograd as autograd

def jacobian(f, x):
    """Computes the Jacobian of f w.r.t x.
    :param f: function R^N -> R^N
    :param x: torch.tensor of shape [B, N]
    :return: Jacobian matrix (torch.tensor) of shape [B, N, N]
    """
    B, N = x.shape
    y = f(x)
    jacobian = list()
    for i in range(N):
        v = torch.zeros_like(y)
        v[:, i] = 1.
        dy_i_dx = autograd.grad(y, x, grad_outputs=v, retain_graph=True, create_graph=True, allow_unused=True)[0]  # shape [B, N]
        jacobian.append(dy_i_dx)
    jacobian = torch.stack(jacobian, dim=2).requires_grad_()
    return jacobian

The actual score matching loss function is split between the norm of the model output $\frac{1}{2} \left\Vert \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2$, which we compute first. Then, we compute the trace of the Jacobian loss $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$ and return the sum as the full loss.

In [ ]:
def score_matching(model, samples, train=False):
    samples.requires_grad_(True)
    logp = model(samples)
    # Compute the norm loss
    norm_loss = torch.norm(logp, dim=-1) ** 2 / 2.
    # Compute the Jacobian loss
    jacob_mat = jacobian(model, samples)
    tr_jacobian_loss = torch.diagonal(jacob_mat, dim1=-2, dim2=-1).sum(-1)
    return (tr_jacobian_loss + norm_loss).mean(-1)

Finally, we can run our code to train the model

In [ ]:
dataset = torch.tensor(data.T).float()
for t in range(2000):
    # Compute the loss.
    loss = score_matching(model, dataset)
    # Before the backward pass, zero all of the network gradients
    optimizer.zero_grad()
    # Backward pass: compute gradient of the loss with respect to parameters
    loss.backward()
    # Calling the step function to update the parameters
    optimizer.step()
    if ((t % 500) == 0):
        print(loss)
tensor(0.4140, grad_fn=<MeanBackward1>)
tensor(-13.6779, grad_fn=<MeanBackward1>)
tensor(-38.8591, grad_fn=<MeanBackward1>)
tensor(-48.4584, grad_fn=<MeanBackward1>)

We can observe that our model has learned to represent $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_x \log p(x)$ by plotting the output value across the input space

In [ ]:
def plot_gradients(model, data, plot_scatter=True):
    xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
    scores = model(torch.tensor(xx).float()).detach()
    scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
    scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
    # Perform the plots
    plt.figure(figsize=(16,12))
    if (plot_scatter):
        plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
    plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
    plt.xlim(-1.5, 2.0)
    plt.ylim(-1.5, 2.0)
plot_gradients(model, data)
No description has been provided for this image

Langevin dynamics¶

After training, our model is able to produce an approximation of the gradient of the probabiliity, such that $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_x \log p(x)$. Therefore, we could use this to generate data by relying on a simple gradient ascent from a given point by using an initial sample $\mathbf{x}_{0} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$, and then using the gradient information to find a local maximum of $p(\mathbf{x})$

$$\mathbf{x}_{t + 1} = \mathbf{x}_t + \epsilon \nabla_{\mathbf{x}_t} log p(\mathbf{x}_t)$$

where $\epsilon$ defines the size of the step we take in the direction of the gradient (akin to the learning rate).

In [ ]:
def sample_simple(model, x, n_steps=20, eps=1e-3):
    x_sequence = [x.unsqueeze(0)]
    for s in range(n_steps):
        x = x + eps * model(x)
        x_sequence.append(x.unsqueeze(0))
    return torch.cat(x_sequence)

x = torch.Tensor([1.5, -1.5])
samples = sample_simple(model, x).detach()
plot_gradients(model, data)
plt.scatter(samples[:, 0], samples[:, 1], color='green', edgecolor='white', s=150)
# draw arrows for each  step
deltas = (samples[1:] - samples[:-1])
deltas = deltas - deltas / np.linalg.norm(deltas, keepdims=True, axis=-1) * 0.04
for i, arrow in enumerate(deltas):
    plt.arrow(samples[i,0], samples[i,1], arrow[0], arrow[1], width=1e-4, head_width=2e-2, color="green", linewidth=3)
No description has been provided for this image

However, the previous procedure does not produce a true sample from $\mathbf{x} \sim p(\mathbf{x})$. In order to obtain such a sample, we can rely on a special case of Langevin dynamics. In this case, Langevin dynamics can produce true samples from the density $p(\mathbf{x})$, by relying only on $\nabla_{\mathbf{x}} \log p(\mathbf{x})$. The sampling is defined in a way very similar to MCMC approaches, by applying recursively

$$\mathbf{x}_{t + 1} = \mathbf{x}_t + \frac{\epsilon}{2} \nabla_{\mathbf{x}_t} log p(\mathbf{x}_t) + \sqrt{\epsilon} \mathbf{z}_{t}$$

where $\mathbf{z}_{t}\sim \mathcal{N}(\mathbf{0},\mathbf{I})$. It has been shown in Welling et al. (2011) that under $\epsilon \rightarrow 0, t \rightarrow \inf$: $\mathbf{x}_t$ converges to an exact sample from $p(\mathbf{x})$. This is a key idea behind the score-based generative modeling approach.

In order to implement this sampling procedure, we can once again start from $\mathbf{x}_{0} \sim \mathcal{N}(\mathbf{0},\mathbf{I})$, and progressively anneal $\epsilon \rightarrow 0$ at each step, to obtain true samples from $p(\mathbf{x})$.

In [ ]:
def sample_langevin(model, x, n_steps=10, eps=1e-2, decay=.9, temperature=1.0):
    x_sequence = [x.unsqueeze(0)]
    for s in range(n_steps):
        z_t = torch.rand(x.size())
        x = x + (eps / 2) * model(x) + (np.sqrt(eps) * temperature * z_t)
        x_sequence.append(x.unsqueeze(0))
        eps *= decay
    return torch.cat(x_sequence)

x = torch.Tensor([1.5, -1.5])
samples = sample_langevin(model, x).detach()
plot_gradients(model, data)
plt.scatter(samples[:, 0], samples[:, 1], color='green', edgecolor='white', s=150)
# draw arrows for each mcmc step
deltas = (samples[1:] - samples[:-1])
deltas = deltas - deltas / np.linalg.norm(deltas, keepdims=True, axis=-1) * 0.04
for i, arrow in enumerate(deltas):
    plt.arrow(samples[i,0], samples[i,1], arrow[0], arrow[1], width=1e-4, head_width=2e-2, color="green", linewidth=3)
No description has been provided for this image

Sliced score matching¶

The previously defined score matching with this loss is not scalable to high-dimensional data, nor deep networks, because of the computation of $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$. Indeed, the computation of the Jacobian is a $O(N^2 + N)$ operation, thus not being suitable for high-dimensional problems, even with the optimized solution proposed in the previous code. More recently, Song et al. [ 7 ] proposed to use random projections to approximate the computation of $\text{ tr}\left( \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \right)$ in score matching. This approach called Sliced Score Matching allows to replace the optimization objective with

$$ E_{\mathbf{v} \sim \mathcal{N}(0, 1)} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \mathbf{v}^T \nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x}) \mathbf{v} + \frac{1}{2} \left\Vert \mathbf{v}^T \mathcal{F}_{\theta}(\mathbf{x}) \right\lVert_2^2 \right] , $$

where $\mathbf{v} \sim \mathcal{N}(0, 1)$ are a set of Normal-distributed vectors. They show that this can be computed by using forward mode auto-differentiation, which is computationally efficient. This loss can be implemented as follows (where we compute once again the norm first, and then the Jacobian loss)

In [ ]:
def sliced_score_matching(model, samples):
    samples.requires_grad_(True)
    # Construct random vectors
    vectors = torch.randn_like(samples)
    vectors = vectors / torch.norm(vectors, dim=-1, keepdim=True)
    # Compute the optimized vector-product jacobian
    logp, jvp = autograd.functional.jvp(model, samples, vectors, create_graph=True)
    # Compute the norm loss
    norm_loss = (logp * vectors) ** 2 / 2.
    # Compute the Jacobian loss
    v_jvp = jvp * vectors
    jacob_loss = v_jvp
    loss = jacob_loss + norm_loss
    return loss.mean(-1).mean(-1)

As previously, we can perform a simple optimization of this loss given a set of examples.

In [ ]:
# Our approximation model
model = nn.Sequential(
    nn.Linear(2, 128), nn.Softplus(),
    nn.Linear(128, 128), nn.Softplus(),
    nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
dataset = torch.tensor(data.T)[:1000].float()
for t in range(2000):
    # Compute the loss.
    loss = sliced_score_matching(model, dataset)
    # Before the backward pass, zero all of the network gradients
    optimizer.zero_grad()
    # Backward pass: compute gradient of the loss with respect to parameters
    loss.backward()
    # Calling the step function to update the parameters
    optimizer.step()
    # Print loss
    if ((t % 500) == 0):
        print(loss)
tensor(0.0436, grad_fn=<MeanBackward1>)
tensor(-1.1006, grad_fn=<MeanBackward1>)
tensor(-6.4248, grad_fn=<MeanBackward1>)
tensor(-9.4135, grad_fn=<MeanBackward1>)

We can check our approximation by relying on the same function as before.

In [ ]:
plot_gradients(model, data)
No description has been provided for this image

Denoising score matching¶

Originally, the notion of denoising score matching was discussed by Vincent [ 3 ] in the context of denoising auto-encoders. In that case, this allows to completely remove the use of $\nabla_{\mathbf{x}} \mathcal{F}_{\theta}(\mathbf{x})$ in the computation of score matching. To do so, we can first corrupt the input point $\mathbf{x}$ with a given noise vector, leading to a distribution $q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})$. Then, score matching can be used to estimate the score of this perturbed data distribution. It has been shown in [3], that the optimal network that approximates $\mathcal{F}_{\theta}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$ can be found by minimizing the following objective

$$ E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}) - \nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) \right\lVert_2^2 \right] , $$

However, it should be noted that $\mathcal{F}_{\theta}(\mathbf{x}) = \nabla_{\mathbf{x}} \log q_{\sigma}(\mathbf{x}) \approx \nabla_{\mathbf{x}} \log p(\mathbf{x})$ is only true when the noise is small enough to consider that $q_{\sigma}(\mathbf{x}) \approx p(\mathbf{x})$. As it has been shown in [ 3 ] , [ 8 ] , if we choose the noise distribution to be $q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})=\mathcal{N}(\tilde{\mathbf{x}}\mid\mathbf{x}, \sigma^{2}\mathbf{I})$, then we have $\nabla_{\tilde{\mathbf{x}}} \log q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x}) = \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}}$. Therefore, the denoising score matching loss simply becomes

$$ \mathcal{l}(\theta;\sigma) = E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}} \right\lVert_2^2 \right] , $$

We can implement the denoising score matching loss as follows

In [ ]:
def denoising_score_matching(scorenet, samples, sigma=0.01):
    perturbed_samples = samples + torch.randn_like(samples) * sigma
    target = - 1 / (sigma ** 2) * (perturbed_samples - samples)
    scores = scorenet(perturbed_samples)
    target = target.view(target.shape[0], -1)
    scores = scores.view(scores.shape[0], -1)
    loss = 1 / 2. * ((scores + target) ** 2).sum(dim=-1).mean(dim=0)
    return loss

We rely on the same model and optimizer as before

In [ ]:
# Our approximation model
model = nn.Sequential(
    nn.Linear(2, 128), nn.Softplus(),
    nn.Linear(128, 128), nn.Softplus(),
    nn.Linear(128, 2)
)
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
dataset = torch.tensor(data.T).float()
for t in range(2000):
    # Compute the loss.
    loss = denoising_score_matching(model, dataset)
    # Before the backward pass, zero all of the network gradients
    optimizer.zero_grad()
    # Backward pass: compute gradient of the loss with respect to parameters
    loss.backward()
    # Calling the step function to update the parameters
    optimizer.step()
    # Print loss
    if ((t % 1000) == 0):
        print(loss)
tensor(10007.4111, grad_fn=<MulBackward0>)
tensor(10055.4170, grad_fn=<MulBackward0>)
In [ ]:
plot_gradients(model, data)
No description has been provided for this image

Noise conditional score networks¶

In a recent paper, Song and Ermon [ 8 ] built on these ideas to develop a score-based generative framework called Noise-Conditional Score Networks (NCSN). They underlined several flaws in the existing score matching objectives. First, they showed that the use of (sliced) score matching without noise was inconsistent when the data followed the manifold hypothesis, ie. the noiseless objectives were only consistent when the distribution spanned the whole space. Second, they showed that low density regions could cause difficulties for both score matching and sampling with Langevin dynamics.

To address these issues, they propose to rely on perturbed data to bypass the manifold issues, but also to simultaneously estimate scores corresponding to all noise levels by training a single conditional score network. To do so, we consider a positive geometric sequence of noise variances $\{\sigma_{i}\}_{i=1}^{L}$, choosing $\sigma_{1}$ to be large enough to mitigate manifold issue, and satisfying $\frac{\sigma_{1}}{\sigma_{2}} = \cdots = \frac{\sigma_{L-1}}{\sigma_{L}} > 1$. The goal is to train a conditional network to estimate the gradients of all perturbed data distributions, ie. $$ \forall \sigma \in \{\sigma_{i}\}_{i=1}^{L}, \mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma) \approx \nabla_{\mathbf{x}} \log q_{\sigma}(\mathbf{x}) $$ The network learning this is called a Noise Conditional Score Network (NCSN). To perform this optimization, we start from the previously-defined denoising score matching objective $$ \mathcal{l}(\theta;\sigma) = E_{q_{\sigma}(\tilde{\mathbf{x}}\mid\mathbf{x})} E_{\mathbf{x} \sim p(\mathbf{x})} \left[ \left\Vert \mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma) + \frac{\tilde{\mathbf{x}} - \mathbf{x}}{\sigma^{2}} \right\lVert_2^2 \right] , $$ As this adress a single noise level, this objective can be combined to obtain a single unified objective as $$ \mathcal{L}(\theta;\{\sigma_{i}\}_{i=1}^{L}) = \frac{1}{L} \sum_{i=1}^{L} \lambda(\sigma_{i})\mathcal{l}(\theta;\sigma_{i}) $$ where $\lambda(\sigma_{i}) > 0$ is a coefficient function depending on $\sigma_{i}$. This objective can be implemented as follows

In [ ]:
def anneal_dsm_score_estimation(model, samples, labels, sigmas, anneal_power=2.):
    used_sigmas = sigmas[labels].view(samples.shape[0], *([1] * len(samples.shape[1:])))
    perturbed_samples = samples + torch.randn_like(samples) * used_sigmas
    target = - 1 / (used_sigmas ** 2) * (perturbed_samples - samples)
    scores = model(perturbed_samples, labels)
    target = target.view(target.shape[0], -1)
    scores = scores.view(scores.shape[0], -1)
    loss = 1 / 2. * ((scores - target) ** 2).sum(dim=-1) * used_sigmas.squeeze() ** anneal_power
    return loss.mean(dim=0)

As can be seen in the previous equation, this model requires a conditional network $\mathcal{F}_{\theta}(\tilde{\mathbf{x}}, \sigma_{i})$ that also takes as input the different noise levels $\sigma_{i}$. Therefore, we redefine the previous model to account for that

In [ ]:
import torch.nn.functional as F

class ConditionalLinear(nn.Module):
    def __init__(self, num_in, num_out, num_classes):
        super().__init__()
        self.num_out = num_out
        self.lin = nn.Linear(num_in, num_out)
        self.embed = nn.Embedding(num_classes, num_out)
        self.embed.weight.data.uniform_()

    def forward(self, x, y):
        out = self.lin(x)
        gamma = self.embed(y)
        out = gamma.view(-1, self.num_out) * out
        return out
    
    
class ConditionalModel(nn.Module):
    def __init__(self, num_classes):
        super().__init__()
        self.lin1 = ConditionalLinear(2, 128, num_classes)
        self.lin2 = ConditionalLinear(128, 128, num_classes)
        self.lin3 = nn.Linear(128, 2)
    
    def forward(self, x, y):
        x = F.softplus(self.lin1(x, y))
        x = F.softplus(self.lin2(x, y))
        return self.lin3(x)

Finally, we can perform the same optimization as before

In [ ]:
sigma_begin = 1
sigma_end = 0.01
num_classes = 4
sigmas = torch.tensor(np.exp(np.linspace(np.log(sigma_begin), np.log(sigma_end), num_classes))).float()
# Our approximation model
model = ConditionalModel(num_classes)
dataset = torch.tensor(data.T).float()
# Create ADAM optimizer over our model
optimizer = optim.Adam(model.parameters(), lr=1e-3)
for t in range(5000):
    # Compute the loss.
    labels = torch.randint(0, len(sigmas), (dataset.shape[0],))
    loss = anneal_dsm_score_estimation(model, dataset, labels, sigmas)
    # Before the backward pass, zero all of the network gradients
    optimizer.zero_grad()
    # Backward pass: compute gradient of the loss with respect to parameters
    loss.backward()
    # Calling the step function to update the parameters
    optimizer.step()
    # Print loss
    if ((t % 1000) == 0):
        print(loss)
tensor(1.0236, grad_fn=<MeanBackward1>)
tensor(0.8097, grad_fn=<MeanBackward1>)
tensor(0.7809, grad_fn=<MeanBackward1>)
tensor(0.7596, grad_fn=<MeanBackward1>)
tensor(0.7753, grad_fn=<MeanBackward1>)
In [ ]:
xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
labels = torch.randint(0, len(sigmas), (xx.shape[0],))
scores = model(torch.tensor(xx).float(), labels).detach()
scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
# Perform the plots
plt.figure(figsize=(16,12))
plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
plt.xlim(-1.5, 2.0)
plt.ylim(-1.5, 2.0);
No description has been provided for this image
In [ ]:
xx = np.stack(np.meshgrid(np.linspace(-1.5, 2.0, 50), np.linspace(-1.5, 2.0, 50)), axis=-1).reshape(-1, 2)
labels = torch.ones(xx.shape[0]).long()
scores = model(torch.tensor(xx).float(), labels).detach()
scores_norm = np.linalg.norm(scores, axis=-1, ord=2, keepdims=True)
scores_log1p = scores / (scores_norm + 1e-9) * np.log1p(scores_norm)
# Perform the plots
plt.figure(figsize=(16,12))
plt.scatter(*data, alpha=0.3, color='red', edgecolor='white', s=40)
plt.quiver(*xx.T, *scores_log1p.T, width=0.002, color='white')
plt.xlim(-1.5, 2.0)
plt.ylim(-1.5, 2.0)
Out[ ]:
(-1.5, 2.0)
No description has been provided for this image

Bibliography¶

[1] Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. arXiv preprint arXiv:2006.11239.

[2] Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. arXiv preprint arXiv:1503.03585.

[3] Vincent, P. (2011). A connection between score matching and denoising autoencoders. Neural computation, 23(7), 1661-1674.

[4] Song, J., Meng, C., & Ermon, S. (2020). Denoising Diffusion Implicit Models. arXiv preprint arXiv:2010.02502.

[5] Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., & Chan, W. (2020). WaveGrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713.

[6] Hyvärinen, A. (2005). Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(Apr), 695-709.

[7] Song, Y., Garg, S., Shi, J., & Ermon, S. (2020, August). Sliced score matching: A scalable approach to density and score estimation. In Uncertainty in Artificial Intelligence (pp. 574-584). PMLR.

[8] Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems (pp. 11918-11930).

Inspiration and sources¶

https://colab.research.google.com/github/google/jax/blob/master/docs/notebooks/score_matching.ipynb

https://github.com/ermongroup/sliced_score_matching